Longest Sorted Sequence Algorithm for Parallel Text Alignment

نویسندگان

  • Tiago Ildefonso
  • José Gabriel Pereira Lopes
چکیده

This paper describes a language independent method for aligning parallel texts (texts that are translations of each other, or of a common source text), statistically supported. This new approach is inspired on previous work by Ribeiro et al (2000). The application of the second statistical filter, proposed by Ribeiro et al, based on Confidence Bands (CB), is substituted by the application of the Longest Sorted Sequence algorithm (LSSA). LSSA is described in this paper. As a result, 35% decrease in processing time and 18% increase in the number of aligned segments was obtained, for Portuguese­French alignments. Similar results were obtained regarding Portuguese­English alignments. Both methods are compared and evaluated, over a large parallel corpus made up of Portuguese, English and French parallel texts (approximately 250Mb of text per language).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Parallel Non-Alignment Based Approach to Efficient Sequence Comparison using Longest Common Subsequences

Biological sequence comparison programs have revolutionized the practice of biochemistry, molecular and evolutionary biology. Pairwise comparison is the method of choice for many computational tools developed to analyze the deluge of genetic sequence data. Unfortunately, a comprehensive study of the strengths and weaknesses of different alignment algorithms applied to different biological probl...

متن کامل

An Efficient Parallel Algorithm for Longest Common Subsequence Problem on GPUs

Sequence alignment is an important problem in computational biology and finding the longest common subsequence (LCS) of multiple biological sequences is an essential and effective technique in sequence alignment. A major computational approach for solving the LCS problem is dynamic programming. Several dynamic programming methods have been proposed to have reduced time and space complexity. As ...

متن کامل

PaMSA: A Parallel Algorithm for the Global Alignment of Multiple Protein Sequences

Multiple sequence alignment (MSA) is a well-known problem in bioinformatics whose main goal is the identification of evolutionary, structural or functional similarities in a set of three or more related genes or proteins. We present a parallel approach for the global alignment of multiple protein sequences that combines dynamic programming, heuristics, and parallel programming techniques in an ...

متن کامل

An Application of the ABS LX Algorithm to Multiple Sequence Alignment

We present an application of ABS algorithms for multiple sequence alignment (MSA). The Markov decision process (MDP) based model leads to a linear programming problem (LPP), whose solution is linked to a suggested alignment. The important features of our work include the facility of alignment of multiple sequences simultaneously and no limit for the length of the sequences. Our goal here is to ...

متن کامل

BitPAl: a bit-parallel, general integer-scoring sequence alignment algorithm

MOTIVATION Mapping of high-throughput sequencing data and other bulk sequence comparison applications have motivated a search for high-efficiency sequence alignment algorithms. The bit-parallel approach represents individual cells in an alignment scoring matrix as bits in computer words and emulates the calculation of scores by a series of logic operations composed of AND, OR, XOR, complement, ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005